# Using GPUs for Signal Correlation



#### Michael Clark

with Paul La Plante and Lincoln Greenhill

### Outline

- Motivation
- Mapping the X-engine onto a GPU
  - Or how to get a sustained TFLOP from a Fermi
- Comparison
- Summary and Conclusions

#### **Dipole Array Signal Processing**

#### Heterogeneous HPC solution



Thursday, 27 January 2011

### Correlator

• Builds the cross-power spectrum of the sky

$$S_{ij}(\nu) = \int_{-\infty}^{\infty} (A_i \star A_j)(\tau) e^{-i2\pi\nu\tau} d\tau$$

- Cross-correlate the signal from every station pair
- XF Correlator
  - Cross correlate the signal then FFT
- FX Correlator
  - From convolution theorem  $\mathcal{F}(A \star B) = (\mathcal{F}A) \times (\mathcal{F}B)$
  - FFT the signal then cross-multiply  $O(B N_s (N_s + k))$

### FX Correlator

(Figure taken from Casper collaboration)



### F-engine compute scales linearly with N<sub>s</sub>

X-engine compute scale quadratically with N<sub>s</sub>





#### X-Engine Computing Requirements $N_c = 1000, B = 100 \text{ MHz}$



## FPGAs

- FPGAs (or ASICs) generally used for correlation
- Ideal since only require limited fixed-point ops
- Very power efficient since all die area devoted to problem
- Extremely expensive
  - Astronomers survive on donations and old h/w
- Development time can be very long
- Local experts at Berkeley

http://casper.berkeley.edu/wiki/CASPER\_Correlator



Cuda Programming Guide

#### **CUDA GPU Roadmap**



Huang's GTC Keynote

#### Are GPUs (part of) the solution?

### Motivation

- GPUs are an interesting platform
  - Harness commodity hardware
  - Low cost
  - Easy to develop
  - Forwards compatible
- How competitive are GPUs?
  - Vs. other commodity hardware
  - Vs. FPGAs
  - Testcase: MWA Correlator
- Recall Nbody with GRAPE

## Previous Work

- Harris et al (2008)
  - GPU for X-engine only no consideration for integrated system
- Wayth et al (2009)
  - Consider GPU for both F-engine and X-engine
  - Developed for the 32 station MWA prototype
  - Not a scalable solution
- van Nieuwpoort *et al* (2009)
  - Compared X-engine performance across range of platforms
  - Claim: GPU implementations suffer from memory bottlenecks
  - Conclusion: BG/P and Cell optimal
- All of the above had low percentage of peak performance for GPUs



# MWA Correlator

- Array located in Western Australian Outback
  - Total power budget ~ 50 kW
- Tasked with detecting EOR signal
- 512 stations x 2 polarizations = 1024 inputs
- total bandwidth = 31 MHz
- X-engine requires
  - $1.6 \times 10^{13}$  CMACs = 128 TFLOPS
- How many GPUs required?



# Fermi Architecture

GeForce GTX 480

- 480 processing cores
- 1.345 TFLOPS SP



- 177 GB/s memory bandwidth
- Power consumption 250 Watts
- Easy to program
- \$300 (as of yesterday)



# Fermi Architecture

- High memory bandwidth
- BUT high flop:byte ratio 7.6:1
- Need high arithmetic intensity to keep GPU busy
- Use memory hierarchy
  - Registers
  - Shared memory
  - Device memory
  - PCle bus



# The X-engine

• Mathematically, just the sum of a series of vector outer products

$$S_{ij}(\nu) = \sum_{t=1}^{N_t} X_i(\nu) X_j(\nu)^{\dagger} \quad \text{FLOPS} = \frac{1}{2} 8B(2N_s)(2N_s+1)$$

- Nt=B/T is the integration length, e.g. ~100000
- Matrix is Hermitian just calculate lower triangular elements









#### Station









#### Registers

Each thread

computes

a IxI tile

Matrix elements stored in registers

flop/byte Algorithm: I Hardware: 7.6

#### Cache

Data shared between threads using LI cache

#### Output

Each thread outputs its IxI tile to memory

flop/byte Algorithm: I/N<sub>t</sub> Hardware: 7.6











shared memory



Ist warp reads row











shared memory

















Ist warp compute













2nd warp compute

shared memory









#### Registers



flop/byte Algorithm: I Hardware: I.5

#### **Shared Memory**

Each thread block loads a 16x16 tile

> flop/byte Algorithm: 16 Hardware: 7.6



shared memory

2nd warp reads column



Ist warp reads row


#### Registers

Matrix elements stored in registers

Each thread computes a 2x2 tile

flop/byte Algorithm: 2 Hardware: 7.6

#### Cache

Data shared between threads using L1 cache



#### Registers

Each thread

computes

a IxI tile



flop/byte Algorithm: 2 Hardware: 1.5

#### **Shared Memory**

Each thread block loads a 16x16 tile

> flop/byte Algorithm: 16 Hardware: 7.6



2nd warp reads column



Ist warp reads row



#### Thursday, 27 January 2011



### Performance for $N_s$ =512, $N_c$ =12, N<sub>t</sub>=1000







# Feeding the Beast

- I TFLOP sustained is all very well, but what if the bus can't keep up?
- Algorithm profile
  - Host  $\rightarrow$  Device  $O(N_c N_s B_c)$
  - Kernel  $O(N_c N_s^2 B_c)$
  - Device  $\rightarrow$  Host  $O(N_c N_s^2)$
- Kernel will dominate at large  $N_s$  but what about small  $N_s$ ?
- Previous work showed X-engine limited by bus
  - ~250 GFLOPS at  $N_s = 64$  (van Nieuwpoort *et al*)

### Download time versus Kernel execution time



# The beast eats a lot, but each mouthful is small...

- PCIe bus is a severe constraint to performance
  - $N_s \ge 256$  kernel dominates
- Input signal precision typically 4-5 bits
- FP32 is a complete waste of bandwidth
- Accumulated correlation matrix must be high precision (FPGAs typically use > 12-bit precision)
- Use 8-bit precision for input, keep matrix FP32
  - 8-bit natively supported through textures

### Download time versus Kernel execution time









### Host

### Device Buffer Compute





### Host

### Device Buffer Compute

#### Sustained X-engine performance

128 frequency channels, 1024 time samples



Bandwidth achievable per signal





### Multi-GPU

- Trivial to parallelize across frequency
  - PCIe bus contention manageable
- 4 TFLOPs sustained with 4 GPUs
  - 1300 Watts total
    - 3 GFLOPS/Watt (cf 1684 GFLOPS/Watt #1 Green 500)
  - \$6K for workstation => \$1.50 per GFLOP

### X-engine Performance Across Platforms

| Architecture                | GFLOPS | GOPS/Watt<br>per chip (total) |  |  |
|-----------------------------|--------|-------------------------------|--|--|
| Intel Core i7 (quad)*       | 48.0   | 0.4                           |  |  |
| IBM BG/P*                   | 13.1   | 0.5                           |  |  |
| IBM Cell*                   | 187    | 2.7                           |  |  |
| Nvidia C1060*               | 243    | Ι                             |  |  |
| Nvidia GTX 480 <sup>†</sup> | 1042   | ~4 (3)                        |  |  |

\* van Nieuwpoort and Romein, <sup>†</sup>this work, <sup>#</sup>de Souza et al and Lonsdale et al, <sup>+</sup>Manley

### X-engine Performance Across Platforms

| Architecture                        | GFLOPS | GOPS/Watt<br>per chip (total) |  |  |
|-------------------------------------|--------|-------------------------------|--|--|
| Intel Core i7 (quad)*               | 48.0   | 0.4                           |  |  |
| IBM BG/P*                           | 13.1   | 0.5                           |  |  |
| IBM Cell*                           | 187    | 2.7                           |  |  |
| Nvidia C1060*                       | 243    | Ι                             |  |  |
| Nvidia GTX 480 <sup>†</sup>         | 1042   | ~4 (3)                        |  |  |
| MWA Correlator #<br>(Virtex 4 SX35) |        | (~10)                         |  |  |
| Roach II <sup>+</sup><br>(Virtex 6) |        | (~58)                         |  |  |

\* van Nieuwpoort and Romein, <sup>†</sup>this work, <sup>#</sup>de Souza et al and Lonsdale et al, <sup>+</sup>Manley





### FPGA F-engine 4 kW I 28x GPU X-engine 42 kW

64x GPU 30 kW

76 kW total





# Hybrid Correlator

• Proposed correlator for LEDA



### Scaling to the Future



# Summary and Conclusion

- X-engine is a perfect match to the GPU
  - 78% peak performance
- Low cost and development time
  - Easy to keep with bleeding edge
- Not (yet) power competitive with FPGAs
- Future: hybrid correlators?
- Combine X-engine with Calibration and Imaging

### **Optimization** lessons

- Shared memory AND register tiling critical
- Minimize integer arithmetic
  - Precalculate offsets
  - Texture lookup
- Thread synchronization is costly
- The compiler doesn't always know what it's doing








































### Don't trust compilers

• Compare these "identical" code fragments

a += b\*c + d\*c + e\*f + g\*h;

a += b\*c;

770 GFLOPS

| a | += | d*c; |             |
|---|----|------|-------------|
| a | += | e*f; | 1020 GFLOPS |
| a | += | g*h; |             |